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Abstract 

Detecting  hidden  communities  from  observed  interactions  is  a  classical  problem.  Theo¬ 
retical  analysis  of  community  detection  has  so  far  been  mostly  limited  to  models  with 
non-overlapping  communities  such  as  the  stochastic  block  model.  In  this  paper,  we  provide 
guaranteed  community  detection  for  a  family  of  probabilistic  network  models  with  over¬ 
lapping  communities,  termed  as  the  mixed  membership  Dirichlet  model,  first  introduced 
in  Airoldi  et  al.  (2008).  This  model  allows  for  nodes  to  have  fractional  memberships  in 
multiple  communities  and  assumes  that  the  community  memberships  are  drawn  from  a 
Dirichlet  distribution.  Moreover,  it  contains  the  stochastic  block  model  as  a  special  case. 

We  propose  a  unified  approach  to  learning  communities  in  these  models  via  a  tensor  spectral 
decomposition  approach.  Our  estimator  uses  low-order  moment  tensor  of  the  observed  net¬ 
work,  consisting  of  3-star  counts.  Our  learning  method  is  based  on  simple  linear  algebraic 
operations  such  as  singular  value  decomposition  and  tensor  power  iterations.  We  pro¬ 
vide  guaranteed  recovery  of  community  memberships  and  model  parameters,  and  present 
a  careful  finite  sample  analysis  of  our  learning  method.  Additionally,  our  results  match  the 
best  known  scaling  requirements  for  the  special  case  of  the  (homogeneous)  stochastic  block 
model. 

Keywords:  Community  detection,  spectral  methods,  tensor  methods,  moment-based 

estimation,  mixed  membership  models. 

1.  Introduction1 

Studying  communities  forms  an  integral  part  of  social  network  analysis.  A  community 
generally  refers  to  a  group  of  individuals  with  shared  interests  (e.g.  music,  sports),  or 
relationships  (e.g.  friends,  co-workers).  Various  probabilistic  and  non-probabilistic  network 
models  attempt  to  explain  community  formation.  In  addition,  they  also  attempt  to  quantify 
interactions  and  the  extent  of  overlap  between  different  communities,  relative  sizes  among 

1Part  of  this  work  was  done  when  AA  and  RG  were  visiting  MSR  New  England.  AA  is  supported  in 
part  by  the  NSF  Career  award  CCF-1254106,  NSF  Award  CCF-1219234,  AFOSR  Award  FA9550-10-1-0310 
and  the  ARO  Award  W911NF-12- 1-0404. 
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Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 
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the  communities,  and  various  other  network  properties.  Studying  such  community  models 
is  also  of  interest  in  other  domains,  e.g.  in  biological  networks. 

While  there  exists  a  vast  literature  on  community  models,  learning  these  models  is 
typically  challenging,  and  various  heuristics  such  as  Markov  Chain  Monte  Carlo  (MCMC)  or 
variational  expectation  maximization  (EM)  are  employed  in  practice.  Such  heuristics  tend 
to  be  unreliable  and  scale  poorly  for  large  networks.  On  the  other  hand,  community  models 
with  guaranteed  learning  methods  tend  to  be  restrictive.  A  popular  class  of  probabilistic 
models,  termed  as  the  stochastic  blockmodels ,  have  been  widely  studied  and  enjoy  strong 
theoretical  learning  guarantees,  e.g.  (White  et  ah,  1976;  Holland  et  ah,  1983;  Fienberg  et  ah, 
1985;  Wang  and  Wong,  1987;  Snijders  and  Nowicki,  1997;  McSherry,  2001).  However,  they 
posit  that  an  individual  belongs  to  a  single  community,  which  does  not  hold  in  most  real 
settings  (Palla  et  ah,  2005). 

In  this  paper,  we  consider  a  class  of  mixed  membership  community  models,  originally 
introduced  by  Airoldi  et  ah  (2008),  and  recently  employed  by  Xing  et  ah  (2010)  and  Gopalan 
et  ah  (2012).  This  model  has  been  shown  to  be  effective  in  many  real-world  settings,  but  so 
far,  no  learning  approach  exists  with  provable  guarantees.  In  this  paper,  we  provide  a  novel 
learning  approach  for  learning  these  models  and  establish  regimes  where  the  communities 
can  be  recovered  efficiently.  The  mixed  membership  community  model  of  Airoldi  et  ah 
(2008)  has  a  number  of  attractive  properties.  It  retains  many  of  the  convenient  properties  of 
the  stochastic  block  model.  For  instance,  conditional  independence  of  the  edges  is  assumed, 
given  the  community  memberships  of  the  nodes  in  the  network.  At  the  same  time,  it  allows 
for  communities  to  overlap,  and  for  every  individual  to  be  fractionally  involved  in  different 
communities.  It  includes  the  stochastic  block  model  as  a  special  case  (corresponding  to 
zero  overlap  among  the  different  communities).  This  enables  us  to  compare  our  learning 
guarantees  with  existing  works  for  stochastic  block  models,  and  also  study  how  the  extent 
of  overlap  among  different  communities  affects  the  learning  performance. 

1.1.  Summary  of  Results 

We  now  summarize  the  main  contributions  of  this  paper.  We  propose  a  novel  approach  for 
learning  mixed  membership  community  models  of  Airoldi  et  ah  (2008).  Our  approach  is  a 
method-of-moments  estimator  and  incorporates  tensor  spectral  decomposition  techniques. 
We  provide  guarantees  for  our  approach  under  a  set  of  sufficient  conditions.  Finally,  we 
compare  our  results  to  existing  ones  for  the  special  case  of  the  stochastic  block  model,  where 
nodes  belong  to  a  single  community. 

Learning  general  mixed  membership  models:  We  present  a  unified  approach  for  the 
mixed  membership  model  of  Airoldi  et  ah  (2008).  The  extent  of  overlap  between  different 
communities  in  this  model  class  is  controlled  (roughly)  through  a  single  scalar  parameter, 
termed  as  the  Dirichlet  concentration  parameter  ao  :=  JT  a* ,  when  the  community  mem¬ 
bership  vectors  are  drawn  from  the  Dirichlet  distribution  Dir  (a).  When  ao  — >  0,  the  mixed 
membership  model  degenerates  to  a  stochastic  block  model.  We  propose  a  unified  learning 
method  for  the  class  of  mixed  membership  models.  We  provide  explicit  scaling  require¬ 
ments  in  terms  of  the  extent  of  community  overlaps  (through  ao))  the  network  size  n,  the 
number  of  communities  k,  and  the  average  edge  connectivity  across  various  communities. 
For  instance,  for  the  special  case,  where  p  is  the  probability  of  an  intra-community  edge, 
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and  q  corresponds  to  the  probability  of  inter-community  connectivity,  when  the  average 
community  sizes  are  equal,  we  require  that2 

^  =  a) 

Thus,  we  require  n  to  be  large  enough  compared  to  the  number  of  communities  k,  and  for  the 
separation  p—q  to  be  large  enough,  so  that  the  learning  method  can  distinguish  the  different 
communities.  Moreover,  we  see  that  the  scaling  requirements  become  more  stringent  as  ao 
increases.  This  is  intuitive  since  it  is  harder  to  learn  communities  with  more  overlap,  and  we 
quantify  this  scaling.  We  also  quantify  the  error  bounds  for  estimating  various  parameters 
of  the  mixed  membership  model.  Lastly,  we  establish  zero-error  guarantees  for  support 
recovery:  our  learning  method  correctly  identifies  (w.h.p)  all  the  significant  memberships 
of  a  node  and  also  identifies  the  set  of  communities  where  a  node  does  not  have  a  strong 
presence. 

Learning  Stochastic  Block  Models  and  Comparison  with  Previous  Results: 

For  the  special  case  of  stochastic  block  models  (ao  — >  0),  the  scaling  requirements  in  (2) 
reduces  to 

"=^2)-  (2) 

The  above  requirements  match  the  best  known  bounds3  (up  to  poly-log  factors),  and  were 
previously  achieved  by  Yudong  et  al.  (2012)  via  convex  optimization.  In  contrast,  we  pro¬ 
pose  an  iterative  non-convex  approach  involving  tensor  power  iterations  and  linear  algebraic 
techniques,  and  obtain  similar  guarantees  for  the  stochastic  block  model.  For  a  detailed 
comparison  of  learning  guarantees  under  various  methods  for  learning  stochastic  block  mod¬ 
els,  see  (Yudong  et  al.,  2012). 

Thus,  we  provide  guaranteed  recovery  of  the  communities  under  the  mixed  membership 
model,  and  our  scaling  requirements  in  (1)  explicitly  incorporate  the  extent  of  community 
overlaps.  Many  real-world  networks  involve  sparse  community  memberships  and  the  total 
number  of  communities  is  typically  much  larger  than  the  extent  of  membership  of  a  single 
individual,  e.g.  hobbies/interests  of  a  person,  university /company  networks  that  a  person 
belongs  to,  the  set  of  transcription  factors  regulating  a  gene,  and  so  on.  Thus,  we  see  that 
in  this  regime  of  practical  interest,  where  ao  =  0(1),  the  scaling  requirements  in  (1)  match 
those  of  the  stochastic  block  model  in  (2)  (up  to  polylog  factors)  without  any  degradation 
in  learning  performance.  Thus,  we  establish  that  learning  community  models  with  sparse 
community  memberships  is  akin  to  learning  stochastic  block  models,  and  we  present  a 
unified  learning  approach  and  analysis  for  these  models.  To  the  best  of  our  knowledge,  this 
work  is  the  first  to  establish  polynomial  time  learning  guarantees  for  probabilistic  network 
models  with  overlapping  communities,  and  we  provide  a  fast  and  an  iterative  learning 
approach  through  linear  algebraic  techniques  and  tensor  power  iterations. 

2The  notation  f2(-),0(-)  denotes  Sl(-),O(0  up  to  poly-log  factors. 

3There  are  many  methods  which  achieve  the  best  known  scaling  for  n  in  (2),  but  have  worse  scaling  for 
the  separation  p  —  q.  This  includes  variants  of  the  spectral  clustering  method,  e.g.  (Chaudhuri  et  al.,  2012). 
See  (Yudong  et  al.,  2012)  for  a  detailed  comparison. 
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1.2.  Overview  of  Techniques 

We  now  describe  the  main  techniques  employed  in  our  learning  approach  and  in  establishing 
the  recovery  guarantees. 

Method  of  moments  and  subgraph  counts:  We  propose  an  efficient  learning  algo¬ 
rithm  based  on  low  order  moments,  viz.,  counts  of  small  subgraphs.  Specifically,  we  employ 
a  third-order  tensor  which  counts  the  number  of  3-stars  in  the  observed  network.  A  3- 
star  is  a  star  graph  with  three  leaves  and  we  count  the  occurrences  of  such  3-stars  across 
different  groups  of  nodes.  We  establish  that  (suitably  adjusted)  3-star  count  tensor  has  a 
simple  relationship  with  the  model  parameters,  when  the  network  is  drawn  from  a  mixed 
membership  community  model.  In  particular,  we  propose  a  multi-linear  transformation 
(termed  as  whitening)  under  which  the  canonical  polyadic  ( CP)  decomposition  of  the  tensor 
yields  the  model  parameters  and  the  community  vectors.  Note  that  the  decomposition  of  a 
general  tensor  into  its  rank-one  components  is  referred  to  as  its  CP  decomposition  (Kolda 
and  Bader,  2009)  and  is  in  general  NP-hard  (Hillar  and  Lim,  2012).  However,  we  reduce 
our  learning  problem  to  an  orthogonal  symmetric  tensor  decomposition,  for  which  tractable 
decomposition  exists,  as  described  below. 

Tensor  spectral  decomposition  via  power  iterations:  Our  tensor  decomposition 

method  is  based  on  the  popular  tensor  power  iterations,  e.g.  see  (Anandkumar  et  al.,  2012a). 
It  is  a  simple  iterative  method  to  compute  the  stable  eigen-pairs  of  a  tensor.  In  this  paper, 
we  propose  various  modifications  to  the  basic  power  method  to  strengthen  the  recovery 
guarantees  under  perturbations.  For  instance,  we  introduce  a  novel  adaptive  deflation 
techniques.  We  optimize  performance  for  the  regime  where  the  community  overlaps  are 
small. 

Sample  analysis:  We  establish  that  our  learning  approach  correctly  recovers  the  model 
parameters  and  the  community  memberships  of  all  nodes  under  exact  moments.  We  then 
carry  out  a  careful  analysis  of  the  empirical  graph  moments,  computed  using  the  network 
observations.  We  establish  tensor  concentration  bounds  and  also  control  the  perturbation 
of  the  various  quantities  used  by  our  learning  algorithm  via  matrix  Bernstein’s  inequal¬ 
ity  (Tropp,  2012,  thm.  1.4)  and  other  inequalities.  We  impose  the  scaling  requirements  in 
(1)  for  various  concentration  bounds  to  hold. 

1.3.  Related  Work 

Many  algorithms  provide  learning  guarantees  for  stochastic  block  models.  A  popular 
method  is  based  on  spectral  clustering  (McSherry,  2001),  where  community  memberships 
are  inferred  through  projection  onto  the  spectrum  of  the  Laplacian  matrix  (or  its  variants). 
This  method  is  fast  and  easy  to  implement  (via  singular  value  decomposition).  There  are 
many  variants  of  this  method,  e.g.  the  work  by  Chaudhuri  et  al.  (2012)  employs  normal¬ 
ized  Laplacian  matrix  to  handle  degree  heterogeneities.  In  contrast,  the  work  of  (Yudong 
et  al.,  2012)  uses  convex  optimization  techniques  via  semi-definite  programming  learning 
block  models.  For  a  detailed  comparison  of  learning  guarantees  under  various  methods  for 
learning  stochastic  block  models,  see  Yudong  et  al.  (2012).  Recently,  some  non-probabilistic 
approaches  have  been  introduced  with  overlapping  community  models  by  Arora  et  al.  (2012) 
and  Balcan  et  al.  (2012).  However,  their  setting  is  considerably  different  than  the  one  in  this 
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paper.  We  leverage  the  recent  developments  from  Anandkumar  et  al.  (2012c, a, b)  for  learn¬ 
ing  topic  models  and  other  latent  variable  models  based  on  the  method  of  moments.  They 
consider  learning  these  models  from  second-  and  third-order  observed  moments  through 
linear  algebraic  and  tensor-based  techniques.  We  exploit  the  tensor  power  iteration  method 
of  Anandkumar  et  al.  (2012b)  and  provide  additional  improvements  to  obtain  stronger  re¬ 
covery  guarantees.  Moreover,  the  sample  analysis  is  quite  different  in  the  community  setting 
compared  to  other  latent  variable  models  analyzed  in  the  previous  works. 

2.  Community  Models  and  Graph  Moments 
2.1.  Community  Membership  Models 

Notation:  We  consider  network  with  n  nodes  and  let  [n]  :=  {1,  2, . . . ,  n}.  Let  G  be 

the  {0, 1}  adjacency4 5  matrix  for  the  random  network  and  let  Ga,b  be  the  submatrix  of  G 
corresponding  to  rows  A  C  [n]  and  columns  B  C  [n].  For  node  i,  let  7 r*  G  M.k  denote  its 
community  membership  vector.  Define  II  :=  [7Ti  1 7T2 1  •  •  •  |7rn]  G  Mfcxn.  and  let  LG  :=  [7 Tj  :  i  G 
A]  G  denote  the  set  of  column  vectors  restricted  to  A  C  [n].  For  a  matrix  A I,  let 

(M)i  and  (M)*  denote  its  ith  column  and  row  respectively.  For  a  matrix  M  with  singular 
value  decomposition  (SVD)  M  =  UDVT ,  let  {M)^_SV(ji  :=  UDVT  denote  the  £>rank  SVD 
of  M,  where  D  is  limited  to  top-/c  singular  values  of  M .  Let  M'  denote  the  MoorePenrose 
pseudo-inverse  of  M.  Let  I(-)  be  the  indicator  function.  We  use  the  term  high  probability 
to  mean  with  probability  1  —  n~c  for  any  constant  c  >  0. 

Mixed  membership  model:  In  this  model,  the  community  membership  vector  nu  at 

node  u  is  a  probability  vector,  i.e. ,  7r u(i)  =  1,  for  all  u  G  [n].  Given  the  community 

membership  vectors,  the  generation  of  the  edges  is  as  follows:  given  vectors  7rn  and  nv, 
the  probability  of  an  edge  from'1  u  to  v  is  t^P-kv,  and  the  edges  are  independently  drawn. 
Here,  P  G  [0,  l]fcxfc  and  we  refer  to  it  as  the  community  connectivity  matrix.  We  consider  the 
setting  of  Airoldi  et  al.  (2008),  where  the  community  vectors  {7 ru}  are  i.i.d.  draws  from  the 
Dirichlet  distribution,  denoted  by  Dir (cc),  with  parameter  vector  a  G  p>o-  The  probability 
density  function  is  given  by 

p[tt]  =  IK-1*  71  ~  Dil'(«),«o  :=  Ylai’  (3) 

where  T(-)  is  the  Gamma  function  and  the  ratio  of  the  Gamma  function  serves  as  the 
normalization  constant. 

Let  a  denote  the  normalized  parameter  vector  a/ao,  where  ao  :=  J2iai-  In  particular, 
note  that  a  is  a  probability  vector:  Yhi  on  =  1.  Intuitively,  a  denotes  the  relative  expected 
sizes  of  the  communities  (since  E[n-1  =  «*)■  Let  Smax  be  the  largest  entry  in  a, 

and  amin  be  the  smallest  entry.  Our  learning  guarantees  will  depend  on  these  parameters. 

The  stochastic  block  model  is  a  limiting  case  of  the  mixed  membership  model  when  the 
Dirichlet  parameter  is  a  =  «o  •  S,  where  the  probability  vector  a  is  held  fixed  and  ao  — >  0. 

4Our  analysis  can  easily  be  extended  to  weighted  adjacency  matrices  with  bounded  entries. 

5We  consider  directed  networks  in  this  paper,  but  note  that  the  results  also  hold  for  undirected  commu¬ 
nity  models,  where  P  is  a  symmetric  matrix,  and  an  edge  (u,  v)  is  formed  with  probability  7rJ Pirv  =  7 rj Pnu. 


5 


Anandkumar  Ge  Hsu  Kakade 


In  this  case,  the  community  membership  vectors  7Tj  correspond  to  coordinate  basis  vectors. 
In  the  other  extreme  when  «o  — >  oo,  the  Dirichlet  distribution  becomes  peaked  around  a 
single  point,  for  instance,  if  a,  =  c  and  c  — >  oo,  the  Dirichlet  distribution  is  peaked  at 
k~ 1  •  1,  where  1  is  the  all-ones  vector.  Thus,  the  parameter  ao  controls  the  extent  of  overlap 
among  different  communities. 

2.2.  Graph  Moments  Under  Mixed  Membership  Models 

Our  approach  for  learning  a  mixed  membership  community  model  relies  on  the  form  of  the 
graph  moments6 7  under  the  mixed  membership  model.  We  now  describe  the  specific  graph 
moments  used  by  our  learning  algorithm  (based  on  3-star  and  edge  counts)  and  provide 
explicit  forms  for  the  moments,  assuming  draws  from  a  mixed  membership  community 
model. 

Notations:  Recall  that  G  denotes  the  adjacency  matrix,  and  that  Gx,a  denotes  the 

submatrix  corresponding  to  edges  going  from  X  to  A.  Recall  that  P  E  [0,  l]fexfc  denotes  the 
community  connectivity  matrix.  Define 

F:=nTPT  =  [7r1|7r2|---|7rn]TPT.  (4) 

For  a  subset  A  C  [n]  of  individuals,  let  Fa  E  M^lxfc  denote  the  submatrix  of  F  corresponding 
to  nodes  in  A,  i.e.,  Fa  ’■=  IlJ.PT.  Let  Diag(u)  denote  a  diagonal  matrix  with  diagonal 
entries  given  by  a  vector  v.  Our  learning  algorithm  uses  moments  up  to  the  third-order, 
represented  as  a  tensor.  A  third-order  tensor  T  is  a  three-dimensional  array  whose  ( p ,  q,  r)- 
th  entry  denoted  by  Tp,g,r-  The  symbol  <g )  denotes  the  standard  Kronecker  product:  if  u ,  v, 
w  are  three  vectors,  then 

(u  <g>  V  <g>  'w)p}qtr  :=  Up  -  Vq-  Wr.  (5) 

3-star  counts:  The  primary  quantity  of  interest  is  a  third-order  tensor  which  counts  the 
number  of  3-stars.  A  3-star  is  a  star  graph  with  three  leaves  {a,  b,  c}  and  we  refer  to  the 
internal  node  x  of  the  star  as  its  “head”,  and  denote  the  structure  by  x  — >  {a,b,c}.  We 
partition  the  network  into  four  parts  and  consider  3-stars  such  that  each  node  in  the  3-star 
belongs  to  a  different  partition.  Consider  a  partition'  A,  B ,  C,  X  of  the  network.  We  count 
the  number  of  3-stars  from  X  to  A,  B,  C,  and  our  quantity  of  interest  is 

T.y ^{A,b,c}  ■=  <g>  GjB  <g>  Gj>c],  (6) 

I  I  iex 

where  (g>  is  the  Kronecker  product,  defined  in  (5),  and  G{,a  is  the  row  vector  supported  on 
the  set  of  neighbors  of  i  belonging  to  set  A.  Define 

Hx^A  ■=  j-^7  ^  [G^a]  >  G°x  a  '■=  (7 ao  +  “  Wa o  +  1  —  1)1  I'La)  •  (7) 

'  '  iex 

6  We  interchangeably  use  the  term  first  order  moments  for  edge  counts  and  third  order  moments  for 
3-star  counts. 

7For  our  theoretical  guarantees  to  hold,  the  partitions  A,  B,  C,  X  can  be  randomly  chosen  and  are  of 
size  0(n). 
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Similarly,  we  define8  adjusted  third-order  statistics,  B  c j  given  by 

(«o  +  l)(oo  +  2)  T x^{a,b,c}  +2  «o  hx :->a  ®  hx^B  <8> 

- ~  \gJ,A  ®  GU  ®  Fx^yC  +  ®  -B  ®  G~jc  +  [IX-yA  ®  GJb  <8>  G^c  , 

'  '  iex 

(8) 

and  it  reduces  to  the  (scaled  version  of)  3-star  count  T x-y{A,B,C}  defined  in  (6)  for  the 
stochastic  block  model  («o  — >  0). 

Proposition  1  (Moments  in  Mixed  Membership  Model)  Given  partitions  A,  B,  C,  X 

and  GffA  and  T“°,  as  in  (7)  and  (8),  normalized  Dirichlet  concentration  vector  a,  and 
F  :=  UTPT,  where  P  is  the  community  connectivity  matrix  and  II  is  the  matrix  of  commu¬ 
nity  memberships,  we  have 

E[(G“°A)T|IlA,nx]  =  FADiag(a1/2)H/x,  (9) 

k 

E[TxU{A,s,c}  lnA’  nB>nc]  =  22  a-i(FA)i  <8>  (FB)i  ®  (Fc)i,  (10) 

i= 1 

where  ( Fjffi  corresponds  to  ith  column  of  Fa  and  ^ x  relates  to  the  community  membership 
matrix  Ax  as 


H/x  :=  Diag(3  1  '2 ) 


«o  +  inx  -  (Voo  + 1  -  i) 


Moreover,  we  have  that 

IX^EuA^x^x]  =1-  (11) 

3.  Algorithm  for  Learning  Mixed  Membership  Models 

The  simple  form  of  the  graph  moments  derived  in  the  previous  section  is  now  utilized  to 
recover  the  community  vectors  II  and  model  parameters  P ,  a  of  the  mixed  membership 
model.  The  method  is  based  on  the  so-called  tensor  power  method,  used  to  obtain  a  tensor 
decomposition.  For  a  detailed  discussion  on  the  tensor  power  method,  see  (Anandkumar 
et  ah,  2012b).  Below,  we  discuss  the  various  steps  of  our  algorithm. 

Partitioning:  We  first  partition  the  data  into  5  disjoint  sets  A,  B,  C,  X ,  Y .  The  set  X  is 

employed  to  compute  whitening  matrices  Wa,  Wb  and  Wq,  described  in  detail  subsequently, 
the  set  Y  is  employed  to  compute  the  3-star  count  tensor  T“°  and  sets  A,  B,  C  contain  the 
leaves  of  the  3-stars  under  consideration.  The  roles  of  the  sets  can  be  interchanged  to  obtain 
the  community  membership  vectors  of  all  the  sets,  as  described  in  Algorithm  1. 

8To  compute  the  modified  moments  G“°,  and  T“°,  we  need  to  know  the  value  of  the  scalar  ao  :=  cti, 
which  is  the  concentration  parameter  of  the  Dirichlet  distribution  and  is  a  measure  of  the  extent  of  overlap 
between  the  communities.  We  assume  its  knowledge  here. 
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Algorithm  1  {II,  P,  a}  <—  LearnMixedMembership(G,  k,  o.q,  N,  t) 

Input:  Adjacency  matrix  G  E  Mnxn,  k  is  the  number  of  communities,  «o  :=  Yli  aii  where 
a  is  the  Dirichlet  parameter  vector,  N  is  the  number  of  iterations  for  the  tensor  power 
method,  and  r  is  used  for  thresholding  the  estimated  community  membership  vectors, 
specified  in  (42)  in  assumption  A5.  Let  Ac  :=  [n]  \  A  denote  the  set  of  nodes  not  in  A. 
Output:  Estimates  of  the  community  membership  vectors  II  E  M.nxk ,  community  connec¬ 
tivity  matrix  P  E  [0,  l]fcxfc,  and  the  normalized  Dirichlet  parameter  vector  a. 

Partition  the  vertex  set  [n]  into  5  parts  X,  Y,  A,  B,  C. 

Compute  moments  G°x  A,  GffB,  Gffc,  T y\{A  b  C}  us^nS  (7)  and  (8). 

{n^c ,  3}  LearriPartitionCornmunity  (G)*?^ ,  G^B,  Gffc,  T y\{A  B  c}’ 

Interchange  roles9  of  Y  and  A  to  obtain  Ilyc. 

Define  Q  such  that  its  z-tli  row  is  Ql  :=  («o  +  1)  ^  1T. 

Estimate  P  <—  QGQT .  {Recall  that  E[G]  =  nTPII  in  our  model.  We  will  show  that 

Q«(nt)T.^ 

Return  II,  P.  a 


Whitening:  The  whitening  procedure  attempts  to  convert  the  3-star  count  tensor  into 

an  orthogonal  symmetric  tensor.  Consider  the  /c-rank  singular  value  decomposition  (SVD) 
of  the  modified  adjacency  matrix  G“°  defined  in  (7), 

(\X\-1/2G2A)Tk-sva  =  UaDaVJ. 

Define  W a  '■=  UaD jj1,  and  similarly  define  W b  and  Wc  using  the  corresponding  matrices 
G<xB  and  G°^c  respectively.  Now  define 

Ra,b  ■=  (Gax°B)J_svd  •  (Gax°A)k-svdWA,  (12) 

and  similarly  dehne  Rac-  The  whitened  and  symmetrized  graph- moment  tensor  is  now 
computed  as 

TyU{a,b,c}(Wa,  WbRab,  WcRac ), 

where  T“°  is  given  by  (8)  and  the  above  describes  a  multi-linear  transformation  of  the 
tensor. 

Tensor  power  method:  It  can  be  shown  that  the  whitening  procedure  yields  a  sym¬ 

metric  orthogonal  tensor  under  exact  moments.  We  now  describe  the  tensor  power  method 
to  recover  components  of  a  symmetric  orthogonal  tensor  of  the  form 

T  =  ^2  ^vi  ®  Vi  ( 8>  Vi  =  Y]  \vf3,  (13) 

ie[r]  «S[r] 


where  r  denotes  the  tensor  rank  and  we  use  the  notation  vf 3  :=  Vi  (g>  <8>  Vi,  and  the 

vectors  Vi  E  JR6*  are  orthogonal  to  one  another.  Without  loss  of  generality,  we  assume  that 
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Procedure  1  {fhnuS}  LearnPartitionCommunity(Gj°A,  G^B,  G^?c,  T y°-^{a  B  C}’ 
G,  N,  r) _ '  ’  _ 

Compute  rank- A;  SVD:  (| X\-1/2G^A)J_avd  =  UaDaVJ  and  compute  whitening  matrices 
W a  '■=  UaDa 1.  Similarly,  compute  Wb,Wc  and  Rab,Rac  using  (35). 

Compute  whitened  and  symmetrized  tensor  T  «—  T"°_^A  B  cGWa,  WbRab,  WcRac)- 
(A,  TensorEigen(T,  {WAGjA  }i^A>  AV).  {d>  is  a  k  x  k  matrix  with  each  columns  being 

an  estimated  eigenvector  and  A  is  the  vector  of  estimated  eigenvalues.  {WA  GjA}^A  is 
the  set  of  initialization  vectors  and  N  is  the  number  of  iterations.} 

TIac  4—  Thr  es  ( D  i  ag  ( A )  ~ 1 T  T Wj G\<-  A  ,  r)  and  dj  <—  X~2,  for  i  G  [k]. 

Return  Il  u  and  a. 


vectors  {vi}  are  orthonormal  in  this  case.  In  this  case,  each  pair  (A j,Uj),  for  i  £  [r],  can  be 
interpreted  as  an  eigen-pair  for  the  tensor  T,  since 


T(I,Vi,Vi )  =  ^2  ( Vi,Vj)2Vj  =  A iVi,  Vi  £  [r], 

je[r] 


due  to  the  fact  that  ( Vi,Vj ) 
points  of  the  map 


8t,j.  Thus,  the  vectors  {uj}je[r]  can  be  interpreted  as  fixed 


T(I,  v,  v) 


(14) 


where  ||  •  ||  denotes  the  spectral  norm  (and  \\T(I,v,v)\\  is  a  vector  norm),  and  is  used 
to  normalize  the  vector  v  in  (31).  Thus,  a  straightforward  approach  to  computing  the 
orthogonal  decomposition  of  a  symmetric  tensor  is  to  iterate  according  to  the  fixed-point 
map  in  (31)  with  an  arbitrary  initialization  vector.  This  is  referred  to  as  the  tensor  power 
iteration  method.  The  simple  power  iteration  procedure  is  however  not  sufficient  to  get 
good  reconstruction  guarantees  under  empirical  moments.  We  make  some  modifications 
which  involve  (i)  efficient  initialization  and  (ii)  adaptive  deflation.  The  details  are  in  the 
full  version  of  the  paper. 


Reconstruction  after  tensor  power  method:  When  exact  moments  are  available, 

estimating  the  community  membership  vectors  II  is  straightforward,  once  we  recover  all 
the  stable  tensor  eigen-pairs,  since  P  <—  (IIT)fE[G|n]nf .  However,  in  case  of  empirical 
moments,  we  can  obtain  better  guarantees  with  the  following  modification:  the  estimated 
community  membership  vectors  n  are  further  subject  to  thresholding  so  that  the  weak 
values  are  set  to  zero.  This  yields  better  guarantees  in  the  sparse  regime  of  the  Dirichlet 
distribution.  In  addition,  we  define  Q  such  that  its  ith  row  is 


Ql 


(ao  +  1) 


IT 

n^fi 


n 


based  on  estimate  n,  and  the  matrix  P  is  obtained  as  P  <—  QGQT .  We  subsequently 
establish  that  QIT  ~  I,  under  a  set  of  sufficient  conditions  outlined  in  the  next  section. 
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Improved  support  recovery  estimates  in  homophilic  models:  A  sub-class  of 

community  model  are  those  satisfying  homophily.  Homophily  is  the  tendency  to  form  edges 
within  the  members  of  the  same  community,  and  has  been  posited  as  an  important  factor  in 
community  formation  in  social  networks.  We  describe  a  post-processing  method  in  Proce¬ 
dure  3  for  models  with  community  connectivity  matrix  P  satisfying  P(i,  i)  =  p  >  P(i,j )  =  q 
for  all  i  7^  j.  This  yields  a  set  of  communities  for  each  node  where  it  has  a  significant  pres¬ 
ence,  and  we  also  rule  out  communities  for  every  node  where  the  presence  is  not  strong 
enough. 

Procedure  2  {5}  •(—  SupportRecoveryHomophilicModels(G,  k,  ao,  £,  II) 

Input:  Adjacency  matrix  G  E  Mnxn,  k  is  the  number  of  communities,  «o  :=  Yli  where 
a  is  the  Dirichlet  parameter  vector,  £  is  the  threshold  for  support  recovery,  corresponding 
to  significant  community  memberships  of  an  individual.  Get  estimate  II  from  Algorithm  1. 
Also  asume  the  model  is  homophilic:  P(i,  i)  =  p  >  P(i,j )  =  q,  for  all  i  ^  j. 

Output:  S  E  {0, 1  yixk  is  the  estimated  support  for  significant  community  memberships. 

Consider  partitions  A,  B,C,X,Y  as  in  Algorithm  1. 

Define  Q  on  lines  of  definition  in  Algorithm  1,  using  estimates  II.  Let  the  z-th  row  for 
set  B  be  QlB  :=  («o  +  - }^1T.  Similarly  define  Qlc. 

Estimate  Fc  4-  Gc,bQ ~b,  P  e-  Qc'Fc- 
if  cko  =  0  (stochastic  block  model)  then 
for  iGC  do 

Let  i*  <—  argmaxjg[fc]  Fc(x,i )  and  S(i*,x)  <—  1  and  0  o.w.  {Assign  community  with 
maximum  average  degree.} 

end  for 
else 

Let  Ft  be  the  average  of  diagonals  of  P,  L  be  the  average  of  off-diagonals  of  P 
for  x  E  C,  i  E  [k\  do 

S(i,  x)  <—  1  if  Fc{x,  i)  >  L  +  (Ft  —  L)  ■  ^  and  zero  otherwise. {Identify  large  entries} 

end  for 
end  if 

Permute  the  roles  of  the  sets  A,  B,  C,  X,  Y  to  get  results  for  remaining  nodes. 


4.  Sample  Analysis  for  Proposed  Learning  Algorithm 
4.1.  Sufficient  Conditions  and  Recovery  Guarantees 

It  is  easier  to  present  the  guarantees  for  our  proposed  algorithm  for  the  special  case,  where  all 
the  communities  have  the  same  expected  size,  and  the  entries  of  the  community  connectivity 
matrix  P  are  equal  on  diagonal  and  off-diagonal  locations: 

=  p  =  +  p>q ■  (is) 

In  other  words,  the  probability  of  an  edge  according  to  P  only  depends  on  whether  it  is 
between  two  individuals  of  the  same  community  or  between  different  communities.  The 
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above  setting  is  also  well  studied  for  stochastic  block  models  (ao  =  0),  allowing  us  to 
compare  our  results  with  existing  ones.  The  results  for  general  mixed  membership  models 
are  available  in  the  full  version  of  the  paper  (?). 

[Al]  Sparse  regime  of  Dirichlet  parameters:  The  community  membership  vectors 

are  drawn  from  the  Dirichlet  distribution,  Dir  (a),  under  the  mixed  membership  model. 
We  assume  that  a*  <  1  for  i  £  [fc]  a,  <  1,  which  is  the  sparse  regime  of  the  Dirichlet 
distribution. 

[A2]  Condition  on  the  network  size:  Given  the  concentration  parameter  of  the 

Dirichlet  distribution,  ao  :=  Yliaii  we  require  that 

n  =  D(fc2(a0  +  l)2),  (16) 


and  that  the  sets  A,B,C,X,Y  in  the  partition  are  0(n).  Note  that  from  assumption  Al, 
ctj  <  1  which  implies  that  ao  <  k.  Thus,  in  the  worst-case,  when  ao  =  ©(&),  we  require10 
n  =  D(/c4),  and  in  the  best  case,  when  ao  =  ©(1),  we  require  n  =  Cl(k2).  The  latter  case 
includes  the  stochastic  block  model  (ao  =  0). 


[A3]  Condition  on  edge  connectivity:  Recall  that  p  is  the  probability  of  intra- 

connnunity  connectivity  and  q  is  the  probability  of  inter-community  connectivity.  We  re¬ 
quire  that 


p-q  _  Q  ( («o  +  1  )k\ 
y/P  V  nl/2  ) 


(17) 


The  above  condition  is  on  the  standardized  separation  between  intra-community  and  inter¬ 
community  connectivity  (note  that  yjp  is  the  standard  deviation  of  a  Bernoulli  random 
variable).  The  above  condition  is  required  to  control  the  perturbation  in  the  whitened 
tensor  (computed  using  observed  network  samples),  thereby,  providing  guarantees  on  the 
estimated  eigen-pairs  through  the  tensor  power  method. 


[A4]  Condition  on  number  of  iterations  of  the  power  method:  We  assume  that 
the  number  of  iterations  N  of  the  tensor  power  method  satisfies 


N>C2- 


^log(fc)  +  log  log 


(  p~q 
V  P 


(18) 


for  some  constant  C2. 

[A5]  Choice  of  r  for  thresholding  community  vector  estimates:  The  threshold  r 
for  obtaining  estimates  II  of  community  membership  vectors  in  Algorithm  1  is  chosen  as 


(  e  fkV®o  VP  \ 

T  =  \  V  V™  P-q)  1 

(  0.5, 


ao  7^  0, 

(19) 

a0  =  0, 

(20) 

For  the  stochastic  block  model  (ao  =  0),  since  7 r,  is  a  basis  vector,  we  can  use  a  large 
threshold.  For  general  models  (ao  ^  0),  t  can  be  viewed  as  a  regularization  parameter  and 

10The  notation  f2(-),0(-)  denotes  fi(-),0(-)  up  to  poly-log  factors. 
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decays  as  n-1/2  when  other  parameters  are  held  fixed.  We  are  now  ready  to  state  the  error 
bounds  on  the  estimates  of  community  membership  vectors  II  and  the  block  connectivity 
matrix  P.  II  and  P  are  the  estimates  computed  in  Algorithm  1. 

Recall  that  for  a  matrix  M,  (M)1  and  (Af ),  denote  the  ith  row  and  column  respectively. 
We  say  that  an  event  holds  with  high  probability,  if  it  occurs  with  probability  1  —  n~c  for 
some  constant  c  >  0. 


Theorem  2  (Guarantees  on  Estimating  P,  II) 


£-k/\  '■=  max  lift*  —  IP ||l  =  O 


£p  :=  max  \P{j  —  P{j\  =  O 

i,j£[n] 


Under  A1-A5,  we  have  w.h.p. 

(q0  +  l)3/2^np\ 

(P  ~  <l)  ) 

(q0  +  1  )3/2k^p\ 

V n  ' 


The  proofs  are  given  in  the  full  version  of  the  paper  (?).  The  main  ingredient  in  es¬ 
tablishing  the  above  result  is  the  tensor  concentration  bound  and  additionally,  recovery 
guarantees  under  the  tensor  power  method.  We  now  provide  these  results  below. 

Recall  that  Fa  ■=  II \PT  and  <h  =  WJFa  Diag(S1//2)  denotes  the  set  of  tensor  eigen¬ 
vectors  under  exact  moments,  and  d*  is  the  set  of  estimated  eigenvectors  under  empirical 
moments.  We  establish  the  following  guarantees. 


Lemma  3  (Perturbation  bound  for  estimated  eigen-pairs)  Under  the  assumptions 
A1-A4,  the  recovered  eigenvector- eigenvalue  pairs  (<f>j,Aj)  from  the  tensor  power  method 
satisfies  with  high  probability,  for  a  permutation  9,  such  that 


max||<lj  —  $0(^11  <  8 k  1^2et, 
ie[fc] 


max  |  A i 
i 


<  5 £t, 


(21) 


The  tensor  perturbation  bound  et  is  given  by 


Et  :  = 


rpa0 

-‘-y-RAB.C'} 


(Wa,WbRab,WCRac) 


( («o  +  1  )k3/2y/p\ 

{  c P-qW n  )’ 


E[T yU{a,b,c}(W^  Wb,  WcWaubuc] 


(22) 


where  ||T||  for  a  tensor  T  refers  to  its  spectral  norm. 


Stochastic  block  models  (ckq  =  0):  For  stochastic  block  models,  assumptions  A2  and 
A3  reduce  to 

n  =  c  =  e(^)  =  o(AT).  (23, 

This  matches  with  the  best  known  scaling  (up  to  poly- log  factors),  and  was  previously 
achieved  via  convex  optimization  by  Yudong  et  al.  (2012)  for  stochastic  block  models. 
However,  our  results  in  Theorem  4  do  not  provide  zero  error  guarantees  as  in  (Yudong 
et  ah,  2012).  We  strengthen  our  results  to  provide  zero-error  guarantees  in  Section  3.3.1 
below  and  thus,  match  the  scaling  of  Yudong  et  al.  (2012)  for  stochastic  block  models. 
Moreover,  we  also  provide  zero-error  support  recovery  guarantees  for  recovering  significant 
memberships  of  nodes  in  mixed  membership  models  in  Section  3.3.1. 
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Dependence  on  a o:  The  guarantees  degrade  as  ao  increases,  which  is  intuitive  since 
the  extent  of  community  overlap  increases.  The  requirement  for  scaling  of  n  also  grows  as 
ao  increases.  Note  that  the  guarantees  on  £w  and  £p  can  be  improved  by  assuming  a  more 
stringent  scaling  of  n  with  respect  to  ao,  rather  than  the  one  specified  by  A2. 

4.1.1.  Zero-error  guarantees  for  support  recovery 

Recall  that  we  proposed  Procedure  3  as  a  post-processing  step  to  provide  improved  support 
recovery  estimates.  We  now  provide  guarantees  for  this  method.  We  now  specify  the 
threshold  £  for  support  recovery  in  Procedure  3. 

[A6]  Choice  of  £  for  support  recovery:  The  threshold  £  in  Procedure  3  satisfies 

£  =  Cl(£p), 

where  £p  is  specified  in  Theorem  4.  We  now  state  the  guarantees  for  support  recovery. 

Theorem  4  (Support  recovery  guarantees)  Assuming  A1-A6  and  (24)  hold,  the  sup¬ 
port  recovery  method  in  Procedure  3  has  the  following  guarantees  on  the  estimated  support 
set  S:  with  high  probability, 

n(i,  j)  >  £  =4-  S(i,  j)  =  1  and  U(i,j)<^^S(i,j)  =  0,  Vi  £  [k\,  j  £  [n\,  (24) 

where  II  is  the  true  community  membership  matrix. 

Thus,  the  above  result  guarantees  that  the  Procedure  3  correctly  recovers  all  the  “large” 
entries  of  II  and  also  correctly  rules  out  all  the  “small”  entries  in  II.  In  other  words,  we 
can  correctly  infer  all  the  significant  memberships  of  each  node  and  also  rule  out  the  set  of 
communities  where  a  node  does  not  have  a  strong  presence. 

The  only  shortcoming  of  the  above  result  is  that  there  is  a  gap  between  the  “large” 
and  “small”  values,  and  for  an  intermediate  set  of  values  (in  [£/2,  £]) ,  we  cannot  guarantee 
correct  inferences  about  the  community  memberships.  Note  this  gap  depends  on  ep,  the 
error  in  estimating  the  P  matrix.  This  is  intuitive,  since  as  the  error  £p  decreases,  we  can 
infer  the  community  memberships  over  a  large  range  of  values. 

For  the  special  case  of  stochastic  block  models  (i.e.  limao  ►  0),  we  can  improve  the 
above  result  and  give  a  zero  error  guarantee  at  all  nodes  (w.h.p).  Note  that  we  no  longer 
require  a  threshold  £  in  this  case,  and  only  infer  one  community  for  each  node. 

Corollary  5  (Zero  error  guarantee  for  block  models)  Assuming  A1-A5  and  (24)  hold, 
the  support  recovery  method  in  Procedure  3  correctly  identifies  the  community  memberships 
for  all  nodes  with  high  probability  in  case  of  stochastic  block  models  (ao  —>  0) . 

Thus,  with  the  above  result,  we  match  the  state-of-art  results  of  Yudong  et  al.  (2012) 
for  stochastic  block  models  in  terms  of  scaling  requirements  and  recovery  guarantees. 
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